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[57] ABSTRACT 

The present invention is directed to a system and method for 
monitoring performance in an information handling system 
in a minimally intrusive manner. The method of the present 
invention includes a collection phase, a placement phase, 
and an instrumentation phase. During the collection phase, 
a workload (i.e. code segment) is traced, and instruction and 
data accesses are determined. During the placement phase, 
the trace data is passed to a cache simulator. The cache 
simulator uses the trace data, along with hardware and 
instrumentatioQ characteristics, to determine an optimal 
placement for instrumentation code and data. If the desired 
conflict level is not attainable, the best possible placement is 
found by executing the code to be monitored with a variety 
of instrumentation code and data placements until the least 
intrusive placement is found. The best possible placement is 
then used during the instrumentation phase to actually 
execute the instrumented code. 

L2 Claims, 5 Drawing Sheets 
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CODE INSTRUMENTATION SYSTEM WITH 

NON INTRUSIVE MEANS AND CACHE 
MEMORY OPTIMIZATION FOR DYNAMIC 
MONITORING OF CODE SEGMENTS 

HELD OF THE INVENTION 

The present invention relates to information processing 
systems and, more particularly, to software tools and meth- 
ods for monitoring, modeling, and enhancing system per- 
formance. 

BACKGROUND OF THE INVENTION 

To enhance system performance, it is helpful to know 
which modules within a system are the most frequently 
executed. These most frequently executed modules are 
referred to as "hot" modules. Within these hot modules, it is 
also useful to know which lines of code are the most 
frequently executed. These frequently executed code seg- 
ments are known as "hot spots." 

A programmer hoping to improve system performance 
should focus his or her efforts on improving the performance 
of the hot modules and hot spots within those modules. 
Improving the performance of the most frequently executed 
modules and code segments will have the most effect on 
improving overall system performance. It does not make 
sense to spend much time improving the performance of 
modules or code segments which are rarely executed, as this 
will have little, if any, effect on the overall system perfor- 
mance. 

Many modem processors contain hardware capability 
which allows performance data to be collected. For example, 
most modern processors have the capability to measure 
cycle time. Many modem processors also have the ability to 
count other items, such as cache misses, floating point 
operations, bus utilization, and translation look-aside buffer 
(TLB) misses. To count cache misses, for example, a bit or 
a sequence of bits within a control register is set to a 
predetermined code. This bit sequence tclk the processor to 
increment a counter every time there is a cache miss. When 
the bit sequence is reset, the processor stops counting cache 
misses, and the total number of cache misses can be read 
from another register or from a memory area. 

Once a programmer determines a code segment (i.e. a hot 
spot) that needs further perfonmance analysis, the program- 
mer then "instruments" the code to be tested. For example, 
suppose the programmer determines that a particular code 
segment, consisting of twenty lines of code, is a hot spot that 
needs ftirther performance analysis. The progranuner will 
put a "hook** (i.e. an instruction or group of instmctions) in 
front of the twenty instmctions. The hook will typically be 
a jump instruction, causing execution to jump to an instru- 
mentation routine. The instrumentation routine will start 
some type of performance analysis. For example, the instru- 
mentation routine may set an appropriate bit or set of bits in 
a control register to turn on cache miss counting in the 
processor. Tlie instmmentation code then returns control to 
the instmctions being tested. At the end of the code segment 
being tested, the programmer will insert another hook. This 
hook typically jumps to an instrumentation routine which 
nirns off performance testing. In the example given, the 
instrumentation routine would set the appropriate bit or bits 
in the control register to stop cache miss counting, and then 
would store the cache miss count. 

One problem with this type of instmmentation is that the 
instrumentation routines may affect the performance results 
of the code being analyzed. For example, if any of the 
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instructions in the instrumentation routines are in the same 
cache congruency class as an instruction in the code being 
tested, an instmmentation instruction could cause a tested 
instmction to be forced out of the instruction cache. This 

5 would affect the cache hit/miss ratio and the cycles per 
instmction (CPI) measurement for the code being tested. 
Similar problems could occur with data cache measurements 
if any data accesses by the instrumentation routine forced 
data out of the data cache. Similar problems could also occur 

10 with other types of measurements, such as translation looka- 
side buffer (TLB) measurements. 

Consequently, it would be desirable to have a minimally 
intrusive system and method for measuring performance in 
an information handling system. It would be desirable if the 

15 system and method greatly decreased the chance of instm- 
mentation code or data impacting the performance measure- 
ments of tested code. 

SUMMARY OF THE INVENTION 

20 Accordingly, the present invention is directed to a system 
and method for monitoring performance in an information 
handling system in a minimally intrusive manner. The 
method of the present invention includes a collection phase, 
a placement phase, and an instmmentation phase. During the 

25 collection phase, a workload (i.e. code segment) is traced, 
and instruction and data accesses are determined. During the 
placement phase, the trace data is passed to a cache simu- 
lator. The cache simulator uses the trace data, along with 
hardware and instrumentation characteristics, to determine 

30 an optimal placement for instrumentation code and data. If 
the desired conflict level is not attainable, the best possible 
placement is found by executing the code to be monitored 
with a variety of instrumentation code and data placements 
until the least intrusive placement is found. The best possible 

35 placement is then used during the instmmentation phase to 
actually execute the instmmented code. 

One embodiment of the present invention is an informa- 
tion handling system capable of performing the method 
described above. Another embodiment of the present inven- 

^ tion is as sets of instmctions resident in an information 
handling system. 

One advantage of the present invention is that it allows 
performance monitoring of code segments with minimal 
intrusion. Another advantage of the present invention is that 
it decreases the chance of instrumentation code or data 
impacting the performance measurements of code being 
tested. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing and other features and advantages of the 
present invention will become more apparent from the 
detailed description of the best mode for carrying out the 
invention as rendered below. In the description to follow, 
reference will be made to the accompanying drawings, 
where like reference numerals are used to identify like parts 
in the various views and in which: 

RG. I is a block diagram of an information handling 
system capable of executing the performance monitoring 
method of the present invention; 

FIGS. 2A and 2B are block diagrams depicting portions of 
RAM and cache memory in the system of FIG. 1; 

FIG. 3 is a flow chart depicting the multiple phases of the 
present invention; 
65 FIG. 4 is a flow chart depicting further details of a method 
for adding minimally intrusive code and data according to 
the teachings of the present invention; and 
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HG. 5 is a flow chart depicting further details of a method code segments of approximately 10 instmctions up to 

for determining the best possible placement of instmmen- approximately 200 instructions. For large code segments 

tation code and data according to the teachings of the present (i.e. more than 200 instructions), the overhead associated 

invention. with overwriting instructions or data in the cache, and then 

nPTAii Pn nP^PRimiON op a prppprrfo s moving the same items back into the cache, is a small 

DblAlLhU UtSCRlFllON Oh A HKbhbKRbU percentage of total execution time, and thus has a minimal 

EMBODIMENT OF THE IN VEN'nON l^^^, onVrformance measurements. For small code seg- 

The invention may be implemented on a variety of ments (i.e. fewer than ten instmctions), the instrumentation 

hardware platforms, including personal computers, is so tight that it is virmally impossible to get an accurate 

workstations, mini -computers, and mainframe computers. picture of cache usage. 

Many of the steps of the method of the present invention por illustrative purposes, the present invention will be 
may be advantageously implemented on parallel processors described with reference to collecting instruction and data 
of various types. Referring now to FIG. 1, a typical con- caching statistics for a code segment. One of the concerns 
figuration of an information handling system that may be with collecting instruction and dau cache information is that 
used to practice the novel method of the present invention the act of collecting the information will affect the measure- 
will be described. The computer system of FIG. 1 has at least ments. This is because instrumentation code or data may 
one processor 10. Processor 10 is interconnected via system inadvertently be in the same cache congruency class as the 
bus 12 to random access memory (RAM) 16, read only code segment instructions or data. A cache congruency class 
memory (ROM) 14, and input/output (I/O) adapter 18 for is a class of addresses that have the same mapping into 
connecting peripheral devices such as disk units 20, tape cache. 

drives 40, and printers 42 to bus 12, user interface adapter p^r example, suppose that a particular piece of data for 

22 for connecting keyboard 24, mouse 26 having buttons ^ode segment being tested can be put into one of four 

17a and 176, speaker 28, microphone 32, and/or other user places in the data cache. These four places in the data cache 

interface devices such as a touch screen device 29 to bus 12, 53^^ cache congruency class. Once all four 

communication adapter 34 for connecting the information ^ places are filled with data, a least recently used algorithm is 

handling system to a data processing network, and display ^^cd to determine which piece of data is overwritten the next 

adapter 36 for connecting bus 12 to display device 38. Each j^j^g ^ new piece of data needs to be written into the data 

processor 10 includes a level one cache memory 39. Addi- cache. If instrumentation data is in the same cache congru- 

tional levels of cache memory may be present in processor ^jj^y class, it is possible that a piece of instrumentation data 

10 or connected to bus 12. Communication adaptor 34 may 3^ ^1 overwrite a piece of data that would not normally be 

link the system depicted in FIG. 1 with hundreds or even overwritten. This will cause a cache miss to occur, which 

thousands of similar systems, or other devices, such as would not normally occur. A similar situation can occur with 

remote printers, remote servers, or remote storage units. instmctions mapping into the instruction cache. It is possible 

It is often desirable to collect the intrinsic data of a certain that instmctions firom the code being tested could map to the 

workload, or code segment, to help identify performance 35 same place in the instmction cache as instrumentation 

problems within an information handling system. Intrinsic instmctions. The system and method of the present invention 

data is data that describes inherent characteristics of the minimizes these types of mismeasurements by ensuring that 

code. For example, a code segment may be written in such the instrumentation code and instrumentation data does not 

a way that it will always cause a cache miss. The cache miss intrude upon the code segment being tested. Of course, one 

is an inherent characteristic of the code segment. Alternately, ^ skilled in the art will understand that there will be other 

a code segment may be written such that it will always cause optimization advantages associated with the present inven- 

an intermpt which then causes a cache miss. This type of lion. 

cache miss is also an inherent characteristic of the code problems associated with collecting cache miss data 

segment. Intrinsic data can be used as input to system design ^re iUustrated pictorially in FIGS. 2A and 2B. Code under 

for both hardware and software. For example, data such as 45 test (CUT) instructions 44 and CUT data 46 are loaded into 

the cache miss rate is vital in determining hardware cache CUT instructions 44 map into instruction cache 48 

geometry, whereas instmction count data is critical for use ^t cache lines 49, 50, and 51. CUT data 46 map into data 

in compiler optimization. cache 53 at cache lines 54, 55, and 56. Now suppose that 

Two kinds of intrinsic data can be collected, deterministic instrumentation instructions and data are loaded into RAM 

data and non-deterministic data. Deterministic data includes 50 16 as depicted in FIG. 2B. Instrumentation instructions 60 

such things as data access and instruction execution are loaded before and after CUT instructions 44. Instrumen- 

sequences. Usually, external factors, such as interrupts, do tation data 61 is loaded before CUT data 46. At first glance, 

not disturb the collection of deterministic data. Non- it appears that the instrumentation instructions and data do 

deterministic data includes such things as cache hit/miss not interfere with the code segment under test. However, a 

ratios and TLB hit/miss rations, and are much more sensitive 55 closer look at the instmction and data caches show that the 

to external factors, such as interrupts. The collection and use cache hit/miss ratio will be affected by the instrumentation 

of non-deterministic data can be inaccurate and misleading, instmctions and data. Instmmentation instructions 60 map 

especially when measured over a relatively small workload. into instmction cache 48 at locations 62 and 63, partly 

The present invention describes a system and method for overwriting cache lines 49 and 51. Similarly, instrumenta- 

coUecting intrinsic data of a certain workload, or code 60 tion data 61 maps into data cache 53 at locations 64 and 65. 

segment, while minimizing external factors that affect the The mapping to location 64 does not affect CUT data 46. 

validity of non-deterministic data. In particular, a multi- However, the mapping to location 65 partially overwrites 

phased method is used to collect instruction and data sta- cache line 56, and thus will have an effect on the data cache 

tisiics while minimizing the side effects of the instrumen- hit/miss ratio. The present invention provides a system and 

tation code itself. 65 method for minimizing the effect that instrumentation 

The present invention may be used to collect data for any instmctions and data have on the performance measure- 
size workload, but is particularly useful for instrumenting ments collected. 
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lo its preferred embodiment, the present invention is 
carried out in three phases, as depicted in FIG. 3. The first 
phase is referred to as the collection phase 80. During 
collection phase 80, data is collected regarding the code to 
be instrumented, or monitored. This data is then used during 5 
the second phase, placement phase 82. During placement 
phase 82, a determination is made as to where in memory to 
place instrumentation code segments and data segments so 
as to have the least effect on the actual code segment being 
monitored. Finally, during the third phase, instrumentation 
phase 84, the code to be monitored is executed with the 
instrumentation code segments and data segments in optimal 
memory locations. 

Referring now to FIG. 4, a method is shown for imple- 
menting the three phases depicted in FIG. 3. In FIG. 4, steps ^5 
90, 92, and 94 are part of collection phase 80. Steps 96 
through 110 arc part of placement phase 82. Instrumentation 
phase 84 is not depicted in FIG. 4. 

As shown in FIG, 4, the workload (i.e. code segment to be 
monitored) is executed concurrently with a tracing program 20 
(step 90). The tracing program collects both instruction 
cache accesses and data cache accesses for the code segment 
to be monitored (step 92). This data collection step is highly 
intrusive in terms of instruction and data caching, but it does 
not affect the deterministic data (i.e. which cache and data 25 
lines are being used by the code segment to be monitored) 
being collected by the tracing program, llie data is then 
stored for use by a cache simulator (step 94). 

A cache simulator is a program that takes trace data and 
cache geometry as input, and outputs data such as miss rate 30 
and hot cache lines. Such a program can be as complex as 
the actual hardware implementation of the cache or can use 
heuristic algorithms lo obtain estimated results. For 
example, suppose a developer wishes to obtain the cache 
miss rate for an n-by-m way set-associative instruction 35 
cache. A trace of executed instruction addresses is needed, 
along with the length of each variable-length instruction. 
The cache simulator fills in data structures per input, and 
adjusts the output counters according to state changes. If for 
example, the next instruction address maps to a data struc- 40 
ture that is already filled by a previous instruction, a cache 
miss counter is incremented and a cache cast-out event 
occurs. At the end of the simulation, counter data is output, 
as well as a map of the ciurent data structures representing 
the cache content. Cache simulators are used in the art to 45 
experiment with and test different caching algorithms (e.g. 
least recently used, first fit, etc.), decide on certain cache 
geometries, and predict cache performance. 

Still referring to FIG. 4, the trace data is then passed to the 
cache simulator (step 96). The cache geometry and instru- 50 
mentation code and data segments are also passed to the 
cache simulator (step 98). The cache geometry includes data 
such as the size of the cache, associativity (direct-mapped, 
two-way associative, four-way associative, etc.), the size of 
the cache lines, and other data. A cache simulator uses trace 55 
data and cache geometry to predict where code and data will 
be placed in the cache when the code segment runs. The 
cache simulator then executes (step 100), thus determining 
a possible placement for the instnmientation code and data 
segments which will minimize cache mapping conflicts. The 60 
cache simulator then checks to determine if the cache 
mapping conflicts have been minimized (step 102). In other 
words, the cache simulator determines if the cache mapping 
conflicts are greater than the minimum acceptable conflicts. 
'ITie minimum acceptable conflicts may be zero, or some 65 
other desired conflict level set by a user. If cache mapping 
conflicts have been minimized, the placement phase is 
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complete (step 108). If not, the cache simulator checks to see 
if other arraogements are possible (step 104). If other 
arrangements are possible, the instrumentation code and 
data segments are rearranged (step 106) and checked again 
(step 102). Steps 104 and 102 are repeated until cache 
mapping conflicts are minimized. Once conflicts are 
minimized, the instrumentation code can be executed along 
with the code segment to be monitored, and cache miss rates 
are collected for the woridoad. 

There is always the possibility that cache mapping con- 
flicts can not be minimized to the desired level by the cache 
simulator. After trying all possible arrangements, it may be 
determined (in step 104), that no other arrangements are 
possible. In this case, the method depicted in FIG, 5 (step 
110) is used to determine the best possible placement of 
instrumentation code and data. 

As shown in FIG. 5, a list of the least intrusive placements 
(i.e. the placements with the least conflicts) for the instru- 
mentation code and data is obtained from the cache simu- 
lator (step 120). The instrumentation code and data are then 
placed into memory areas according to the fist received (step 
122). The code under lest is then executed (step 124), and the 
execution time is compared to any previous execution times 
(step 126). If the execution time is less than any previous 
execution times, the new execution time is saved (step 128). 
If there are more memory areas to try (step 130), the 
instrumentation code and data are moved to the next 
memory locations (step 132), and the code under test is 
again executed (step 124). Steps 124 through 132 are 
repeated until there are no more locations to try (step 130), 
The placement which resulted in the lowest execution time 
is assumed to be the optimal placement for the instrumen- 
tation code and test. 

Although the invention has been described with a certain 
degree of particularity, it should be recognized that elements 
thereof may be altered by persons skilled in the art without 
departing from the spirit and scope of the invention. One of 
the embodiments of the invention can be implemented as 
sets of instructions resident in the random access memory 16 
of one or more computer systems configured generally as 
described in FIG. 1. Until required by the computer system, 
the set of instructions may be stored in another computer 
readable memory, for example in a hard disk drive, or in a 
removable memory such as an optical disk for eventual use 
in a CD-ROM drive or a floppy disk for eventual use in a 
floppy disk drive. Further, the set of instructions can be 
stored in the memory of another computer and transmitted 
over a local area network or a wide area network, such as the 
Internet, when desired by the user. One skilled in the art 
would appreciate that the physical storage of the sets of 
instructions physically changes the medium upon which it is 
stored electrically, magnetically, or chemically so that the 
medium carries computer readable information. The inven- 
tion is limited only by the following claims and their 
equivalents. 

What is claimed is: 

1. A method for dynamically monitoring performance of 
a code segment executing in an information handling 
system, said method comprising the steps of: 

(a) collecting data regarding the code segment to be 
monitored; 

(b) selecting a final memory placement for one or more 
instrumentation code segments and one or more instru- 
mentation data segments, wherein the final memory 
placement includes one or more memory areas in which 
to place the instrumentation code segments and the 
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instrumentation data segments, and wherein said select- 
ing step includes the steps of: 

(c) choosing a possible memory placement for the 
instru mentation code segments and the instrumenta- 
tion data segments; 

(d) determining an effect of one or more external 
factors based on the possible memory placement; 

(e) if the effect is below a predetermined acceptable 
level, then setting the final memory placement to the 
possible memory segment; 

(f) if the effect is not below the predetermined accept- 
able level, then determining if there are one or more 
additional possible memory placements; 

(g) if there are additional possible memory placements, 
then repeating steps (c) through (g) until the effect is 
below the predetermined acceptable level; 

(h) if there are not additional possible memory 
placements, then performing the following steps: 
(i) placing the instrumentation code segments and 

the instmmentation data segments into one of the 
possible memory placements determined in step 

(c); 

(j) executing the code segment to be monitored; 

(k) calculating an execution lime; 

(1) repeating steps (i) through (k) for each possible 

memory placement; 
(m) setting the final memory placement equal to the 
possible memory segment which results in a low- 
est execution time; and 
(n) executing the code segment to be monitored, along 
with one or more of the instrumentation code segments, 

2. A method for dynamically monitoring performance 
according to claim 1, wherein said collecting step further 
comprises the steps of: 

tracing the code segment to be monitored; and 
storing the data for use during said selecting step. 

3. A method for dynamically monitoring performance 
according to claim 1, wherein said step of determining an 
effect of one or more external factors based on the possible 
memory placement comprises the steps of: 

analyzing a current cache geometry; and 
detenmining a conflict level. 

4. A method for dynamically monitoring performance 
according to claim 1, wherein said selecting step is per- 
formed by a cache simulator. 

5. An information handling system, comprising: 

one or more processors, each processor containing a cache 

memory; 
memory means; 

one or more images of an operating system for controlling 

the operation of said processors; 
at least one system bus connecting the elements of the 

system for cfiQcient operation; 
means for collecting data regarding a code segment to be 

monitored; 

means for selecting a final memory placement for one or 
more instrumentation code segments and one or more 
instrumentation data segments, wherein the final 
memory placement includes one or more memory areas 
in which to place the instrumentation code segments 
and the instrumentation data segments, and wherein 
said means for selecting includes: 
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means for choosing a possible memory placement for 
the instrumentation code segments and the instru- 
mentation data segments; 

means for determining an effect of one or more external 
5 factors based on the possible memory placement; 

means for setting the final memory placement to the 
possible memory placement if the effect is below a 
predetermined acceptable level; 

means for determining if there are one or more addi- 
tional possible memory placements; 

means for repeatedly choosing the possible memory 
placement, and for determining the effect of the 
external factors, until the effect is below the prede- 
termined acceptable level; 

means for executing the code segment to be monitored, 
and determining an execution time, for each of the 
possible memory placements determined by said 
means for choosing; 

means for setting the final memory placement equal to 
the possible memory segment which results in the 
2° lowest execution time; and 

means for executing the code segment to be monitored, 

along with one or more of the instrumentation code 

segments. 

6. An information handling system according to claim 5, 
wherein said means for collecting further comprises: 

means for tracing the code segment to be monitored; and 
means for storing the data for use by said means for 
selecting. 

7. An information handling system according to claim 5, 
wherein said means for determining an effect of one or more 
external factors based on the possible memory placement 
comprises: 

means for analyzing a current cache geometry; and 
35 means for determining a conflict level. 

8. An information handling system according to claim 5, 
wherein said means for selecting comprises a cache simu- 
lator. 

9. A computer program product, in a computer-usable 
40 medium, comprising: 

means for collecting data regarding a code segment to be 
monitored; 

means for selecting a final memory placement for one or 
more instrumentation code segments and one or more 
45 instrumentation data segments, wherein the final 
memory placement includes one or more memory areas 
in which to place the instrumentation code segments 
and the instrumentation data segments, and wherein 
said means for selecting includes; 
50 means for choosing a possible memory placement for 
the instnmientation code segments and the instru- 
mentation data segments; 
means for determining an effect of one or more external 
factors based on the possible memory placement; 
55 means for setting the final memory placement to the 
possible memory placement if the effect is below a 
predetermined acceptable level; 
means for determining if there are one or more addi- 
tional possible memory placements; 
60 means for repeatedly choosing the possible memory 
placement, and for determining the effect of the 
external factors, until the effect is below the predeter- 
mined acceptable level; 
means for executing the code segment to be monitored, 
65 and determining an execution time, for each of the 

possible memory placements determined by said 
means for choosing; 
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means for setting the final memory placement equal to 
the possible memory segment which results in the 
lowest execution time; and 
means for executing the code segment to be monitored, 

along with one or more of the instnimeniation code 

segments. 

10. A computer program product according to claim 9, 
wherein said means for collecting further comprises; 
means for tracing the code segment to be monitored; and 
means for storing the data for use by said means for 
selecting. 
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U. A computer program product according to claim 9, 
wherein said means for determining an effect of one or more 
external factors based on the possible memory placement 
comprises: 

means for analyzing a current cache geometry; and 

means for determining a conflict level. 

12. A computer program product according to claim 9, 
wherein said means for selecting comprises a cache simu- 
lator program. 
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